In this project, we are examining the growth of in the use of the term “diversity”. To do this, we drew from the MEDLINE database in Web of Science, using the search terms “TS=(diversity)” from 1990-2017 for human research only. This search provided 71,528 total results, which we extracted using the bibliometrix package in R. Next, we converted the abstracts of these articles to a text corpus and then used tidytext - a package designed for computational text analysis in R - to analyze patterns with the abstracts of these data. Below is the R Markdown file and replication code for these analyses.

Examining Growth of Published Studies

In this first chunk, we load our data and examine the overall growth of articles from our search query. As we can see, there is a pretty sizable growth in scientific literature that uses the term diversity - from about 500 times in 1990 to over 5000 in 2017.

# loading the .csv file 
text_data <- read_csv("historical_text_data.csv")
# checking to see how the overall data looks 
by_year <- text_data %>% 
  filter(year != "2018") %>% # filtering 2018 articles because they seem to be incomplete
  group_by(year) %>% count(year, sort = TRUE) %>% ungroup()
by_year <- ggplot() + geom_line(aes(y = n, x = year), data = by_year, stat="identity") + 
  labs(title = "Growth in Diversity-Related Publications from 1990-2017",
       caption = "Data Source: Web of Science") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank())
by_year <- ggplotly(by_year); by_year

Word Frequencies Over Time

Next, we want to look at word frequencies by year in the literature. This chunk of code breaks down how common words occur in the abstracts of our dataset. Note that we also remove some frequently occurring words that are not really relevant to our dataset, but these do not systematically alter our results.

# tokenizing the abstract data into words 
abstract_data <- text_data %>% 
  unnest_tokens(word, abstract) %>% 
  anti_join(stop_words)
## Joining, by = "word"
# most frequent word count in abstracts 
abstract_data %>%
  count(word, sort = TRUE)
## # A tibble: 159,784 x 2
##    word          n
##    <chr>     <int>
##  1 diversity 77112
##  2 study     34721
##  3 patients  32787
##  4 human     32166
##  5 results   29482
##  6 1         27696
##  7 genetic   27652
##  8 health    25787
##  9 cell      25463
## 10 analysis  25431
## # ... with 159,774 more rows
# adding custom set of stopwords 
my_stopwords <- tibble(word = c(as.character(1:9), 
                                "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
                                "rights", "reserved", "copyright", "elsevier"))
abstract_data <- abstract_data %>% anti_join(my_stopwords)
## Joining, by = "word"
# looking at word frequencies by year 
abstract_words <- abstract_data %>%
  filter(year != "2018") %>%
  group_by(year) %>% 
  count(word, sort = TRUE) %>% ungroup(); abstract_words
## # A tibble: 650,713 x 3
##     year word          n
##    <dbl> <chr>     <int>
##  1  2017 diversity  6399
##  2  2016 diversity  6154
##  3  2015 diversity  5875
##  4  2014 diversity  5129
##  5  2013 diversity  5036
##  6  2012 diversity  4693
##  7  2011 diversity  4055
##  8  2010 diversity  3794
##  9  2009 diversity  3336
## 10  2017 study      3324
## # ... with 650,703 more rows

Growth in Population-Specific Terms Over Time

Now, we want to look at how the most relevant words vary over time. Brandon chose to include words like diversity, genetic, and population as well as racially-specific and ethnically-specific terms. As we see, the rise of diversity does not necessarily mean that the focus on race or ethnicity is growing in congruence with that term. This could mean that diversity is being used as a catch-all in the scientific literature (i.e. that the multiplicity of the term makes it mean anything and everything) or that diversity is most used in fields like immunology or oncology. We will explore that hypothesis a bit more below.

diversity_terms <- abstract_words %>% 
  filter(year != "2018") %>%
  filter(word == "diversity" |  word == 'genetic' | word == "population" |
         word == "ethnic" | word == "racial" | word == 'race' | 
         word == 'caucasian' | word == 'african' | word == 'black') 
diversity_terms
## # A tibble: 252 x 3
##     year word          n
##    <dbl> <chr>     <int>
##  1  2017 diversity  6399
##  2  2016 diversity  6154
##  3  2015 diversity  5875
##  4  2014 diversity  5129
##  5  2013 diversity  5036
##  6  2012 diversity  4693
##  7  2011 diversity  4055
##  8  2010 diversity  3794
##  9  2009 diversity  3336
## 10  2008 diversity  2896
## # ... with 242 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = word),
                     data = diversity_terms, stat="identity") + 
  labs(title = "Growth in Diversity-Related Terms (1990-2017)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

interactive_graph <- ggplotly(word_graph); interactive_graph

While that was quite useful, we do not see much variation in how race/ethinicity vary over time. One possibility is that variation in those terms is “diluted” by the various terms that researchers use. Thus, the next logical step is to collapse all of the population-specific terms into one category and compare that to other terminology over time. Both Lee (2009) and Kramer (2019) as well as others (e.g. Panofsky and Bliss 2017) have found that biomedical researchers continue to use population-specific terminology to reinforce the notion of population differences in various biological markers. Below, we have collapsed several population-specific terms into one category based. This list of terms which were developed out of Kramer’s (2019) dissertation work. While we won’t claim that this is an exhaustive list of all the population terms that exist in the world, it is a fairly comprehensive list of over 2,100 terms (also see below in text networks section). Let’s take a look how much the use of all these population-specific terms grew over time…

population_specific <- read_csv("population_terms.csv") 
population_specific <- paste(c("\\b(?i)(zcx", population_specific$term, "zxc)\\b"), collapse = "|")

recoded_abstract_data <- abstract_data %>% 
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = population_specific), 
                               yes = "population-specific", no = word))

recoded_abstract_data %>% filter(recoded_word == "population-specific")
## # A tibble: 120,800 x 14
##       id author title publication  year department subject grant_informati~
##    <dbl> <chr>  <chr> <chr>       <dbl> <chr>      <chr>   <chr>           
##  1     4 DUBOI~ LONG~ AIDS (LOND~  1999 UNIVERSIT~ IMMUNO~ <NA>            
##  2     4 DUBOI~ LONG~ AIDS (LOND~  1999 UNIVERSIT~ IMMUNO~ <NA>            
##  3    15 OUBIN~ GENE~ VIRUS RESE~  1999 LABORATOR~ IMMUNO~ <NA>            
##  4    15 OUBIN~ GENE~ VIRUS RESE~  1999 LABORATOR~ IMMUNO~ <NA>            
##  5    16 MONTA~ THE ~ AIDS RESEA~  1999 LABORATOI~ GENETI~ <NA>            
##  6    16 MONTA~ THE ~ AIDS RESEA~  1999 LABORATOI~ GENETI~ <NA>            
##  7    16 MONTA~ THE ~ AIDS RESEA~  1999 LABORATOI~ GENETI~ <NA>            
##  8    16 MONTA~ THE ~ AIDS RESEA~  1999 LABORATOI~ GENETI~ <NA>            
##  9    16 MONTA~ THE ~ AIDS RESEA~  1999 LABORATOI~ GENETI~ <NA>            
## 10    16 MONTA~ THE ~ AIDS RESEA~  1999 LABORATOI~ GENETI~ <NA>            
## # ... with 120,790 more rows, and 6 more variables: keyword <chr>,
## #   pubmed_id <chr>, doi <chr>, country <chr>, word <chr>,
## #   recoded_word <chr>
recoded_abstract_words <- recoded_abstract_data %>%
  filter(year != "2018") %>%
  group_by(year) %>% 
  count(recoded_word, sort = TRUE) %>% ungroup(); recoded_abstract_words
## # A tibble: 642,447 x 3
##     year recoded_word            n
##    <dbl> <chr>               <int>
##  1  2016 population-specific  9308
##  2  2017 population-specific  9271
##  3  2015 population-specific  9208
##  4  2013 population-specific  7809
##  5  2014 population-specific  7733
##  6  2012 population-specific  7036
##  7  2011 population-specific  6484
##  8  2010 population-specific  6434
##  9  2017 diversity            6399
## 10  2016 diversity            6154
## # ... with 642,437 more rows
diversity_terms <- recoded_abstract_words %>% 
  filter(recoded_word == "diversity" | recoded_word == "genetic" | 
         recoded_word == "population" | recoded_word == "population-specific") 
diversity_terms
## # A tibble: 112 x 3
##     year recoded_word            n
##    <dbl> <chr>               <int>
##  1  2016 population-specific  9308
##  2  2017 population-specific  9271
##  3  2015 population-specific  9208
##  4  2013 population-specific  7809
##  5  2014 population-specific  7733
##  6  2012 population-specific  7036
##  7  2011 population-specific  6484
##  8  2010 population-specific  6434
##  9  2017 diversity            6399
## 10  2016 diversity            6154
## # ... with 102 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = recoded_word),
                     data = diversity_terms, stat="identity") + 
  labs(title = "Growth in Diversity-Related Terminology (1990-2017)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

interactive_graph <- ggplotly(word_graph); interactive_graph

Interesting! After we collapsed all of the terms together in the population-specific category, we see a dramatic growth that goes well beyond the growth in diversity. Now, let’s see what the growth of these terms look like when we break it down by general, national, continential, and ethnic groups across various geographies.

general_pop_terms <- read_csv("population_terms.csv") %>% filter(sub_category == "population_general")
us_specific_terms <- read_csv("population_terms.csv") %>% filter(sub_category == "us_specific")
continental_terms <- read_csv("population_terms.csv") %>% 
  filter(category == "continental" | category == "subcontinental")
ling_religious_terms <- read_csv("population_terms.csv") %>% filter(category == "linguistic_religious")
national_terms <- read_csv("population_terms.csv") %>% filter(category == "national")
south_american_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "south_america")
african_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "africa")
north_american_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "north_america")
european_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "europe")
asian_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "asia")
all_ethnic_groups <- read_csv("population_terms.csv") %>% 
  filter(category == "south_america" | category == "africa" | 
           category == "asia" | category == "north_america" | category == "europe")
general_pop_terms <- paste(c("\\b(?i)(zxz", general_pop_terms$term, "zxz)\\b"), collapse = "|")
us_specific_terms <- paste(c("\\b(?i)(zxz", us_specific_terms$term, "zxz)\\b"), collapse = "|")
continental_terms <- paste(c("\\b(?i)(zxz", continental_terms$term, "zxz)\\b"), collapse = "|")
ling_religious_terms <- paste(c("\\b(?i)(zxz", ling_religious_terms$term, "zxz)\\b"), collapse = "|")
national_terms <- paste(c("\\b(?i)(zxz", national_terms$term, "zxz)\\b"), collapse = "|")
south_american_ethnic_groups <- paste(c("\\b(?i)(zxz", south_american_ethnic_groups$term, "zxz)\\b"), collapse = "|")
african_ethnic_groups <- paste(c("\\b(?i)(zxz", african_ethnic_groups$term, "zxz)\\b"), collapse = "|")
north_american_ethnic_groups <- paste(c("\\b(?i)(zxz", north_american_ethnic_groups$term, "zxz)\\b"), collapse = "|")
european_ethnic_groups <- paste(c("\\b(?i)(zxz", european_ethnic_groups$term, "zxz)\\b"), collapse = "|")
asian_ethnic_groups <- paste(c("\\b(?i)(zxz", asian_ethnic_groups$term, "zxz)\\b"), collapse = "|")
all_ethnic_groups <- paste(c("\\b(?i)(zxz", all_ethnic_groups$term, "zxz)\\b"), collapse = "|")

recoded_abstract_data <- abstract_data %>% 
  
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = general_pop_terms), 
                               yes = "general population terms", no = word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = us_specific_terms), 
                               yes = "us-specific terms", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = continental_terms), 
                               yes = "continental terms", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = ling_religious_terms), 
                               yes = "linguistic & religious terms", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = national_terms), 
                               yes = "national terms", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = south_american_ethnic_groups), 
                               yes = "south american ethnic groups", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = african_ethnic_groups), 
                               yes = "african ethnic groups", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = north_american_ethnic_groups), 
                               yes = "north american ethnic groups", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = european_ethnic_groups), 
                               yes = "european ethnic groups", no = recoded_word)) %>%
  mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = asian_ethnic_groups), 
                               yes = "asian ethnic groups", no = recoded_word))  

recoded_abstract_words <- recoded_abstract_data %>%
  filter(year != "2018") %>%
  group_by(year) %>% 
  count(recoded_word, sort = TRUE) %>% ungroup(); recoded_abstract_words
## # A tibble: 642,695 x 3
##     year recoded_word       n
##    <dbl> <chr>          <int>
##  1  2017 diversity       6399
##  2  2016 diversity       6154
##  3  2015 diversity       5875
##  4  2014 diversity       5129
##  5  2013 diversity       5036
##  6  2012 diversity       4693
##  7  2015 national terms  4406
##  8  2016 national terms  4358
##  9  2017 national terms  4291
## 10  2011 diversity       4055
## # ... with 642,685 more rows
diversity_terms <- recoded_abstract_words %>% 
  filter(recoded_word == "diversity" | recoded_word == "population" | recoded_word == "population-specific" |
         recoded_word == "general population terms" | recoded_word == "continental terms" | recoded_word == "linguistic & religious terms" | 
         recoded_word == "national terms" | recoded_word == "south american ethnic groups" | recoded_word == "african ethnic groups" |
         recoded_word == "north american ethnic groups" | recoded_word == "european ethnic groups" | recoded_word == "asian ethnic groups" |
         recoded_word == "us-specific terms") 
diversity_terms
## # A tibble: 332 x 3
##     year recoded_word       n
##    <dbl> <chr>          <int>
##  1  2017 diversity       6399
##  2  2016 diversity       6154
##  3  2015 diversity       5875
##  4  2014 diversity       5129
##  5  2013 diversity       5036
##  6  2012 diversity       4693
##  7  2015 national terms  4406
##  8  2016 national terms  4358
##  9  2017 national terms  4291
## 10  2011 diversity       4055
## # ... with 322 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = recoded_word),
                     data = diversity_terms, stat="identity") + 
  labs(title = "Growth in Diversity-Related Terminology (1990-2017)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

interactive_graph <- ggplotly(word_graph); interactive_graph

As we can see here, the majority of growth in population-specific terms is because of an increased proclivity to mention nationalities (e.g. Italian or Japanese), continental terms (e.g. africa, asia, north america), and general population terms (e.g. race, ethnicity, caucasian, african, asian, etc). We do not really see much growth in ethnic groups in any given continent even when we lump all of the different ethnic groups into one category (not shown here).

Examining Diversity Across Different Geographies

The next step is to look at how the concept of “diversity” is used across the world. As the graph below demonstrates, the rise of “diversity” seems mostly to grow in the context of predominantly White, Westernized countries like US, England, the Netherlands, Germany and Switzerland.

# here we are just converting everything to lower case 
abstract_data$country <- tolower(abstract_data$country)

# looking at word frequencies by year 
diversity_by_country <- abstract_data %>%
  filter(year != "2018") %>%
  group_by(year) %>% 
  count(word, country, sort = TRUE) %>% ungroup(); diversity_by_country
## # A tibble: 1,424,517 x 4
##     year word      country           n
##    <dbl> <chr>     <chr>         <int>
##  1  2017 diversity united states  2781
##  2  2016 diversity united states  2753
##  3  2015 diversity united states  2635
##  4  2014 diversity united states  2464
##  5  2013 diversity united states  2401
##  6  2012 diversity united states  2324
##  7  2017 diversity england        2261
##  8  2016 diversity england        1982
##  9  2011 diversity united states  1919
## 10  2015 diversity england        1894
## # ... with 1,424,507 more rows
diversity_by_country <- diversity_by_country %>% 
  filter(word == "diversity") 
diversity_by_country
## # A tibble: 1,001 x 4
##     year word      country           n
##    <dbl> <chr>     <chr>         <int>
##  1  2017 diversity united states  2781
##  2  2016 diversity united states  2753
##  3  2015 diversity united states  2635
##  4  2014 diversity united states  2464
##  5  2013 diversity united states  2401
##  6  2012 diversity united states  2324
##  7  2017 diversity england        2261
##  8  2016 diversity england        1982
##  9  2011 diversity united states  1919
## 10  2015 diversity england        1894
## # ... with 991 more rows
diversity_over_time <- ggplot() + geom_line(aes(y = n, x = year, colour = country),
                     data = diversity_by_country, stat="identity") +
  labs(title = "Growth in Diversity-Related Terminology (From 1990-2017, By Country)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

diversity_over_time <- ggplotly(diversity_over_time); diversity_over_time

Examining Diversity Across Different Academic Disciplines

Lastly, we wanted to look more into the growth of diversity related terms by scientific subject matter. This snippet of code breaks down the number of words occuring in abstracts over time, which is then broken down by MEDLINE’s and Web of Science’s subject categories. I have opted to only include 12 of the 150 different categories that could have been graphed here. Overall, we see that the rise of diversity in genetics & heredity, biochemistry & molecular biology, microbiology, immunology, and infectious disease research. We do not see this same rise in the social sciences, though there admittedly is some overall growth in that domain.

# first we need to break apart all the subject categories for each paper 
text_data <- text_data %>% 
  separate(subject, into = paste("subject", 1:15, sep = "_"), sep = ";") %>%
  gather(value, subject, subject_1:subject_15, na.rm = TRUE) %>% select(-value)

# and remove the annoying parentheticals from some categories
text_data <- text_data %>% 
  separate(subject, into = c("subject", "void"), sep = "[(]") %>% select(-void)

# then we git rid of the extra white space and make everything lower case to standardize 
text_data$subject <- stri_trim_both(text_data$subject)
text_data$subject <- tolower(text_data$subject)

# this shows us we have 134 different subjects and we appear to have removed all the duplicates 
unique(text_data$subject)
##   [1] "biochemistry & molecular biology"             
##   [2] "genetics & heredity"                          
##   [3] "immunology"                                   
##   [4] "pharmacology & pharmacy"                      
##   [5] "biophysics"                                   
##   [6] "pediatrics"                                   
##   [7] "cardiovascular system & cardiology"           
##   [8] "microbiology"                                 
##   [9] "infectious diseases"                          
##  [10] "cell biology"                                 
##  [11] "evolutionary biology"                         
##  [12] "medical ethics"                               
##  [13] "nursing"                                      
##  [14] "anthropology"                                 
##  [15] "toxicology"                                   
##  [16] "hematology"                                   
##  [17] "psychiatry"                                   
##  [18] "psychology"                                   
##  [19] "geriatrics & gerontology"                     
##  [20] "ethnic studies"                               
##  [21] "oncology"                                     
##  [22] "zoology"                                      
##  [23] "health care sciences & services"              
##  [24] "pathology"                                    
##  [25] "behavioral sciences"                          
##  [26] "mathematics"                                  
##  [27] "cultural studies"                             
##  [28] "research & experimental medicine"             
##  [29] "education & educational research"             
##  [30] "medical informatics"                          
##  [31] "information science & library science"        
##  [32] "dentistry, oral surgery & medicine"           
##  [33] "neurosciences & neurology"                    
##  [34] "dermatology"                                  
##  [35] "physiology"                                   
##  [36] "fisheries"                                    
##  [37] "nutrition & dietetics"                        
##  [38] "environmental sciences & ecology"             
##  [39] "computer science"                             
##  [40] "demography"                                   
##  [41] "entomology"                                   
##  [42] "gastroenterology & hepatology"                
##  [43] "general & internal medicine"                  
##  [44] "parasitology"                                 
##  [45] "otorhinolaryngology"                          
##  [46] "respiratory system"                           
##  [47] "virology"                                     
##  [48] "communication"                                
##  [49] "public, environmental & occupational health"  
##  [50] "meteorology & atmospheric sciences"           
##  [51] "orthopedics"                                  
##  [52] "medical laboratory technology"                
##  [53] "business & economics"                         
##  [54] "history"                                      
##  [55] "surgery"                                      
##  [56] "sociology"                                    
##  [57] "anatomy & morphology"                         
##  [58] "ophthalmology"                                
##  [59] "agriculture"                                  
##  [60] "urology & nephrology"                         
##  [61] "legal medicine"                               
##  [62] "food science & technology"                    
##  [63] "biotechnology & applied microbiology"         
##  [64] "religion"                                     
##  [65] "criminology & penology"                       
##  [66] "endocrinology & metabolism"                   
##  [67] "philosophy"                                   
##  [68] "developmental biology"                        
##  [69] "archaeology"                                  
##  [70] "audiology & speech-language pathology"        
##  [71] "rheumatology"                                 
##  [72] "anesthesiology"                               
##  [73] "government & law"                             
##  [74] "allergy"                                      
##  [75] "materials science"                            
##  [76] "social issues"                                
##  [77] "microscopy"                                   
##  [78] "obstetrics & gynecology"                      
##  [79] "substance abuse"                              
##  [80] "reproductive biology"                         
##  [81] "chemistry"                                    
##  [82] "radiology, nuclear medicine & medical imaging"
##  [83] "integrative & complementary medicine"         
##  [84] "biodiversity & conservation"                  
##  [85] "veterinary sciences"                          
##  [86] "transplantation"                              
##  [87] "imaging science & photographic technology"    
##  [88] "plant sciences"                               
##  [89] "nuclear science & technology"                 
##  [90] "social sciences - other topics"               
##  [91] "family studies"                               
##  [92] "mycology"                                     
##  [93] "life sciences & biomedicine - other topics"   
##  [94] "acoustics"                                    
##  [95] "international relations"                      
##  [96] "physics"                                      
##  [97] "rehabilitation"                               
##  [98] "critical care medicine"                       
##  [99] "robotics"                                     
## [100] "engineering"                                  
## [101] "geography"                                    
## [102] "emergency medicine"                           
## [103] "tropical medicine"                            
## [104] "music"                                        
## [105] "electrochemistry"                             
## [106] "energy & fuels"                               
## [107] "architecture"                                 
## [108] "automation & control systems"                 
## [109] "women&apos"                                   
## [110] "art"                                          
## [111] "sport sciences"                               
## [112] "paleontology"                                 
## [113] "science & technology - other topics"          
## [114] "astronomy & astrophysics"                     
## [115] "linguistics"                                  
## [116] "urban studies"                                
## [117] "film, radio & television"                     
## [118] "telecommunications"                           
## [119] "forestry"                                     
## [120] "mining & mineral processing"                  
## [121] "optics"                                       
## [122] "arts & humanities - other topics"             
## [123] "history & philosophy of science"              
## [124] "thermodynamics"                               
## [125] "literature"                                   
## [126] "s studies"                                    
## [127] "marine & freshwater biology"                  
## [128] "social work"                                  
## [129] "geology"                                      
## [130] "operations research & management science"     
## [131] "theater"                                      
## [132] "oceanography"                                 
## [133] "metallurgy & metallurgical engineering"       
## [134] "water resources"
# now we can see how often these words arise by subject 
subject_data <- text_data %>% 
  unnest_tokens(word, abstract) %>% 
  anti_join(stop_words)

growth_by_subject <- subject_data %>%
  filter(year != "2018") %>%
  group_by(year) %>% 
  count(word, subject, sort = TRUE) %>% ungroup()  

subject_data %>% 
  count(subject, sort = TRUE)
## # A tibble: 134 x 2
##    subject                                n
##    <chr>                              <int>
##  1 genetics & heredity              4213638
##  2 biochemistry & molecular biology 4016585
##  3 microbiology                     2216691
##  4 immunology                       1986325
##  5 infectious diseases              1857374
##  6 pharmacology & pharmacy          1466127
##  7 behavioral sciences              1464061
##  8 psychology                       1356911
##  9 pediatrics                       1344591
## 10 cell biology                     1321172
## # ... with 124 more rows
graph_by_subject <- growth_by_subject %>% 
  filter(word == "diversity") %>% 
  filter(subject == "genetics & heredity" | subject == "biochemistry & molecular biology" | 
           subject == "microbiology" | subject == "infectious diseases" | subject == "immunology" | 
           subject == "pharmacology & pharmacy" | subject == "behavioral sciences" |
           subject == "health care sciences & services" | subject == "neurosciences & neurology" |
           subject == "psychology" | subject == "sociology" | 
           subject == "oncology" | subject == "business & economics"
         )

graph_by_subject <- ggplot() + geom_line(aes(y = n, x = year, colour = subject),
                     data = graph_by_subject, stat="identity") +
  labs(title = "Growth in Diversity-Related Terminology (From 1990-2017), By Subject)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

graph_by_subject <- ggplotly(graph_by_subject); graph_by_subject

Instead of selecting particular subjects, we also wanted to know what would happen if we just collapsed the categories into three buckets: (1) genetics & heredity (i.e. evolution, plant and animal sciences), (2) biomedical studies, and (3) the social and behavioral sciences. As a general side note, when one of the original categories did not fit into these three recoded categories (e.g. mathematics or computer science), they were just recoded as “other” and ignored in our final analysis. For a full list of how the original 134 categories were recoded, you can visit this link.

genetics_heredity <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "genetics_heredity")
biomedical_studies <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "biomedical_studies")
social_behavioral <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "social_behavioral")
other_studies <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "other")
genetics_heredity <- paste(c("\\b(?i)(zxz", genetics_heredity$original_subject, "zxz)\\b"), collapse = "|")
biomedical_studies <- paste(c("\\b(?i)(zxz", biomedical_studies$original_subject, "zxz)\\b"), collapse = "|")
social_behavioral <- paste(c("\\b(?i)(zxz", social_behavioral$original_subject, "zxz)\\b"), collapse = "|")
other_studies <- paste(c("\\b(?i)(zxz", other_studies$original_subject, "zxz)\\b"), collapse = "|")

recoded_subject_data <- subject_data %>% 
  select(subject, word, year) %>%
  mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = genetics_heredity), 
                               yes = "genetics & heredity", no = subject)) %>%
  mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = biomedical_studies), 
                               yes = "biomedical studies", no = recoded_subject)) %>%
  mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = social_behavioral), 
                               yes = "social & behavioral sciences", no = recoded_subject)) %>% 
  mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = other_studies), 
                               yes = "other subjects", no = recoded_subject))

growth_by_subject <- recoded_subject_data %>%
  filter(year != "2018") %>%
  group_by(year) %>% 
  count(word, recoded_subject, sort = TRUE) %>% ungroup()

graph_by_subject <- growth_by_subject %>% 
  filter(word == "diversity") 

graph_by_subject <- ggplot() + geom_line(aes(y = n, x = year, colour = recoded_subject),
                     data = graph_by_subject, stat="identity") +
  labs(title = "Growth in Diversity-Related Terminology (From 1990-2017), By Subject") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

interactive_by_subject <- ggplotly(graph_by_subject); interactive_by_subject

Perhaps this is just a by-product of more categories being biomedically-related, but this analysis clearly shows that diversity seems to be growing faster in biomedical studies than in the social & behavioral sciences and genetics & heredity-related studies.

Conclusion

Overall, this document shows a rise in the use of “diversity” across scientific research. We see a 10-fold increase across the 1990s and 2000’s, which mostly occurs in research deriving from Westernized biomedical scientific contexts. Our future analyses will examine more what implications this has for the use of diversity in and outside of that domain.

References

Kramer, B. L. (2019). Molecularization at the intersections: testosterone, prostate cancer and the construction of racial difference. Doctoral dissertation, Rutgers University-School of Graduate Studies.

Lee, C. (2009). “Race” and “ethnicity” in biomedical research: how do scientists construct and explain differences in health?. Social Science & Medicine, 68(6), 1183-1190.

Panofsky, A., & Bliss, C. (2017). Ambiguity and scientific authority: population classification in genomic science. American Sociological Review, 82(1), 59-87.